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ABSTRACT 

An investigation is conducted which presents 
extensive Monte Carlo results which indicate the conditions under 
which a procedure using the F distribution can be used to study the 
robustness of the confidence interval procedures for small samples. A 
review of the literature is presented. Procedure uses a binary data 
matrix. Results indicate that the procedure is an extremely practical 
one . (CK) 
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Introduction 

Educational researchers are generally aware of the fact that, unless the 
measurements used to drav inferences in the study are of sufficient reliability, 
these inferences nay veil be meaninaless manifestations of randon variation. Thus, 
for standardized tests normative infomation as presented in the test rranual will 
be cited whereas for instruments v/hich the researcher has constructed, internal con- 
sistency reliability coefficients such as Cronbach's coefficient alnha (ra) 
(Cronbach, 1951) or its form when items are scored dichotomously, the Kuder-Ri chard- 
son 20 (r^p) (Kuder and Richardson, 1937) are freouently presented. 

Only rarely, however, do researchers concern themselves with the fact that an 
instrument does not have a single reliability but that this index is also a function 
of the population tested. We find, for example, that some standardized tests which 
are quite reliable when used for measurino middle-class children are virtually use- 
less with Head Start populations. 

It occurs to this writer that similar phenomena may be operatino in situations 
where deviations from standard teaching methods or other variations in treatments 
used or populations samnled, may cause normative information supplied in a test 
manual to be wholly irrelevant. In educational experiments or quasi -experiments , 
then, it is the feelinn of this writer that adequacy of test reliability should 
not be taken for oranted but should be constantly checked and that this should be 
done separately for samples v/hich differ on manipulated independent variables. 

Since, in many research studies, moderately small samples are beino oathered, 
point estimate's of a reliability index do not provide enouah information to the 
researcher who is concerned about whether the instrument 1) is reliable enouph 
for his purpose, (if he has jMSt constructed it), 2) is operatino as reliably as 
reported in the manual (if it is a standardized test) or 3) exhibits consistency 
of reliability for different ciroups being tested. It is the feelino of this 
writer that the simple device of presentation of a confidence interval estimates 
of the reliability for each experimental nrouo of subjects used in the study 
would be very useful data to include in the reporting of research results. If 
the arpument developed above is loaical, the next question to be addressed is: 
"What procedures to recommend for confidence interval estimation of the reliability 
when samples are small"? 
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Pestrictirn the discussion to inferences shout P20' population value of 
the Kuder-P.icharc'son 20 reliability coefficient, little infomatirn rill he found 
on this to»>ic in the literature. The cormionlv accepted orocer'ure, v-hich utilizes 
the F distribution, lacks pirnirical or analytic suonort vhen the spmlos arf srall. 
The only other Procedures for n?akino inferences ahcut vhich tin's vriter has 
found, were oiven by Payne and Anderson (1968). These investigators ??pnirically 
derived an extensive set of tables for tostipo that p^o unfortunately, 
they cannot hf» used for interval estination. Thus, a studv of the robustness of 
the confidence interval procedures for small sarnies annpars to be the most 
reasonable first step in any attennt to solve this nroble?^. It is the ooal of this 
i'/ivestiaation to present fairly extensive fbnte Carlo results vhich v.ill indicate 
the conditions undsr v.-hich this procedure can be used. 

The Literature 

Foldt (1065, 19C9) has presented derivations iased on the two-factor random 
model of aralysis of variance (P"OVA) which provide tests of hypotheses and confi- 
dence intervals in the one sanple case and tests of hypotheses in t\'o samnle nroMeiToy 

involvinc r . In the first naper, Feldt clearly points out the prrMems vhich arisj^ 

—tit 

in usina this rrodel to describe dichotomous tost iten^ data. Assumptions which are 
obviously violated are those of normality, homnscedasticity of errors, and indepen- 
dence of the si'bipct effects and errors. 

Another nrobler area is the fact that in contion testino nrocedure a fixed 
test is used. Thus, the two-factor model is not strictly anpronriate, the sarrplinn 
beinp Type 1 (Lord, 1955) as Feldt has also pointed out. The annlicaticn of these 
nrocedures to dichotorpouslv scored, fixed test iten data moht then be considered 
susnect hut, by and larre, the impression obtained from the literature is that, 
because of the viell-knov/n robustness of A"OVA procedures, useful results can be 
obtained. 

Althpuch Feldt did present sore empirical results v.'hic!' vere in neneral anree- 
ment vith the theoretical predictions, they viere verv limited. Usino data from a 
study by Baker (1962), Fpldt obtained the distribution of 200 r^^ values for sample? 
of 15, 30 and 60 suhiects. The empirical percentiles of the distribution of rgo 
compared favorably vith thosp derived from the F distribution. 

Until a recent article by 'litko and Feldt (1969), this vriter could find no 
»'esults which considered the effect of item difficulties on the distribution of 
ron. f'itko and F«»ldt, however, shrved that the sampling distributions of ron are 
similar for two different distributions of item difficulty and that this v/as true 
for five tests with p^^'s ranpino from .G5 to .86. For the thirteen item tests 

O 
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simulated, the item difficulty distributions were concentrated around .5 or 
spread evenly over the ranoe .2 to .8. Althounh exhibits no the siir^ilarity of the 
tv*/o distributions, the results aiven in the fMtko and Feldt paper do not allow a 
straiahtfon-zard comparison of the empirical results with those expected from the 
£ distribution. V'hen this is done, it can be seen that the lov/er empirical per- 
centiles are slightly laraer than those predicted fror tlie £ distribution for p^q 
larger than .5. This means that there is a deficiency of small values of r^o- ^" 
Table 1, which folloi'.'s, are some comparisons of the empirical oercentiles of t^q 
presented by Nitko and Feldt and those exnected on the basis of normal theory. 



Table 1 About Here 



Lack of substantial evidence that the £ distribution Provides a useful model for 
estimation of p^Q with moderate sized samples caused the present writer to under- 
take the research presented in this paner. In light of the distributional problems 
confronted in attemptinp an analytic solution in the small sample case, a Monte 
Carlo investipatlon v;as undertaken. 

Description of the T ests Simulated 
One of the ways that tests typically vary, therefore a useful parameter 
to consider in a simulation study, is the distribution of item difficulty. In the 
study presented here, the following three distributions y^ere considered: homogeneous 
with difficulty parameters from .3 to .7; heterogeneous with difficulties from .1 
to .9; and homogeneous v/lth difficulties ranginn from .1 to .5. In the discussion 
to follow, these tests will be abbreviated as HOfl. HET, and HARD, respectively. 
The actual difficulty indices used for ten item tests are given in Table 2. Twenty 
and thirty item tests were simulated by using t\\'o or three items at each difficulty 
level. 



Table 2 About Here 



In this study, q^q ^'^^^ ^^^^^ ^ parameter. Instead, an approach v/hich 
assumed that the binary response vector was obtained by partitioning a multi- 
dimensional space and applying this partition to^a fhuTtivarlat^^n^rmal continuous 
vector of "latent variables'' was used. This data generation model is consistent 
with the popular normal oaive scaling model described elsewhere (e.g., Lord and 
ilovick, 1968, p. 365-373). Once the success proportions had been designated, the 
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other quantity needed in this data oeneration scheme was the natrix of intercor- 
relations of the latent variables associated v/ith the dichotopous iter responses. 
Three natHces v;ere used in the main body of the study and all three vere nattrrned 
i.e., all oairs of latent variables had the same intercorrelation. These constant 
correlations vere taken to be J, .3, and .6. Thp cor+ination of the three correla- 
tion structures and three difficulty distributions to nine test structures. 
These nine test structures vere increased to 27 t«sts actually simulated by con- 
sidering tests of 10, 20, and 30 items each and the rannp of for these 27 tests 
was .36 to Since the rain concern vas the distribution of r^o ^^^^^ 

samples, data for 30 subjects vere simulated throuohout the study. In order to 
simulate some? actual tests, additional runs vere made vith four tests described by 
Poss (1966) and v.-hich ranced from 12 to 18 items in length. These tests, referred 
to as Y, and 7. in the Poss naner, vere simulated by usinn the item difficulties 

I'hich were alven and obtainina the item intercorrelation ratrix from t!ie vector of 
factor loadings of each item on the common factor. The p^^ ^or each test vas 
laroer than .90 and the item difficulties v;ere tynically in the .3 to .7 ranee. 
The utilization of item parameters vhich characterized actual tests vas felt to be 
important because of the difficulty in oeneralizinr from the constant correlations 
used in the rest of the studv. 

Procedures 

Let the assumption be made that a binary data matrix is available reorcsentinr 
the responses of n subjects to k items, ns^ and ^j^^j vill refer to the mean 
louares in the ANOVA correspondinp tc the subject and item by subject interactions. 
The quantity f^^^ = ^"h^^'^us ~ ^^"■'^0^'^ readily computed. The population 

analoquG of F . will be referred to as F „ in accord vith earlier notation of Feldt 

>-0D —pop 1 

(1965) and is related to by £pop''^^~P20^" • """^^ statistic used in the investi- 
qation was V«F ./F . The comnutation of F was carried out by using the corre- 

^ Ob "~O0P —pop 

lations in the latent variable correlation matrix and the item success proportions 
in a series expansion (Kendall .Stuart, 1961) relatino the correlation in the 
bivariate normal to its ohi coefficient. 

If the tv'o-factor randotn model is appropriate, Foldt has shovm that V should 
be distributed accordino to the F distribution vith N-l and <H-i)(k-l) decrees of 
freedom. Thus, values of this statistic were cast into a freouency distribution 
with boundaries vihich vere the deciles of the anpronriate F distribution. In 
addition, 90% and 95?; open-ended (lower) and closed confidence limits were obtained 
accordino to standard procedures derived by Feldt. 

For clarification, consider the follovrinn nrobabilitv statements vrfiich serve 
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as th& basis for the confidence intervals: 

(2) Pn(C2L<P20 ^^Zr^ ' ~2L ' 

and C^^ ' ^'O-LzoK/Z 
'lote that (1) and (2) refor, resnectively, tn onen-ended and closed confidence 
intervals which are often of interest for p^p. ^nr each net- sanpla nenorated the 
three boundan/ points Cj , Cj,j^, and Cg^. vere ccrrputed for each of a=.10 and .05. 
Counters were advanced if any of the ineouali ties rrpsented in probahility state- 
ments (1) or (2) were violated. These frequency counts i-'or** later converted t6 
S'{nnt)le"*rofvoftiTons '?^r comarisor vith the theoretical protalilities. In tables to 
follow, these three enoirical proportions are denoted as , F,^^, and E^p^ re- 
snectivelv, anr* the sun of the last ti'o is simnly Eg. One thousand data sets v^ere 
Generated for thp ten iten tests, 500 for the 20 and 30 itei^ tests. For the four 
tests from the Poss studv, vtiich ranoed from 12 to 18 iters in lenrth, 1000 data 

sets v/ere oeneratcd. 

The Dopulation o^^'s and the averapc r^Q for the 500 or TOCO values oenerated 
are presented, in Table 3 alono v/ith satpole estimates of the ske^'ness and kurtosis 
of the test score distributions. Summary statistics for the overall fit of t'le 
empirical and tlieoretical distributions are also niven in Tahle 3 in terms of 
poodr^ess of fit statistics. These were computed uf^ino the ten cateoories based 
on the deciles of the annrooriate F distribution. 



Table 3 about here 



It 1s tJie v/riter's opinion that althouoh the results for short tests where 
the latent item intercorrelation is lovi (.10) are not of much practical interest* 
that for the majority of the tests simulated, the test parameters are similar to 
those obtained in practical tc^stlnn situations in education. For examnle, the 
s.vmmetric score distributions (MET and HOP) exhibit varvino donrees of nlatvl^urtosis 
as is commonly found in achievement and antitude test score distributions. Fxcep- 
tions may bf^ the HET test vith p=.60 which is nearly rectanoular, actually sliohtly 
U-shaned. Similarly, the skewed distribution for the VARD test vfith p=.6 is a 
rather severe J-shaped distribution which would he uncornnon in most educational . 



* See footnote to Table 4 for interpretation of these nroportions* 
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scttinos. 

Aaain reforrirr to Table 3, observe that the vr^>n^Vrovm nr^native Mas of 
r^p as an estimator of o^q (cc, seo Icrc^ an^ rovick, 19f8) is not verv serious. 
The averacR valur^ of r^r^ for the samles cenerated is tynically sllnhtly sr»an or 
then paraneter is sn?n , l ut the bi^s hecones trivial vhon ooq is 

laroer than about J. 

As rerarf^s the statistic? rn'^orted in TaMe 3> the I'riter do^s not vior 
tiicir sianificanco as particularly irnnortant. ^ut thev are rn^sented to indicate in 
General how well t^e distribution of V annroxinates the F distribution. For nine 
of the 27 tests sirulated, the ooodnoss of fit statistic v-as sicnificant at the 
S% level, indicatino oross lack of fit of the ennirical to the theoretical <listri' 
bution. '^s the reader surely realizes, vhat is rorn imortant is V.^o. fit in the 
tailr. of thn distributions since this noverns the adoouacv of the inferential 
nrocedur'?s. Comports cnpcerninn the resi'lts» then Hll be included vnth the 
discussion on the accuracy of the confi'^'^nce interval cstiration, the results of 
v'hich, for the f^ain body of tests simulated, follo'*' in Tabic 4. ^ 



Tabic 4 about here 



PESULTS 

In evaluatinp the results of this invostioatinn, it is useful to consider the 
sanplino variation which can be exeected when the Linoriial distritution is the an- 
oronriate model. Presented in the table below are the standard errors for a sannle 
proportion fron Donulations vith iT=.in and .05 and ! as?^d on sarnies of size 500 and 

moo. 

Pricf table of standard errors nronortims 

Proportion 
.10 -05 

sample 500 .013 (.013) .0098 (.010) 

size 1000 .0095(.01O) .00C9 (.007) 

The values in parentheses above arc rounded versions which were used to deter- 
nine intervals v*ithin vhich a resultinc oronortion minht be e^nected to fall about 
ePj% if the tin^e if the theoretical oerccrtiles were corrr^ct. ^'hen the enpirical 
nronortions and in Table 4 are comnared with these intervals, it is found 
that all of thr I'PT tests for p«.l are within the liriits imposed. The nore nlaty- 
kurtic HFT tests with p«.3 and .P have a rather laroe numi^or of entries, actuallv 
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19 of ths 24, t»hich are outside of these limits. The irtcrestino fact is that all 
of these 19 prnifical nrooortlons are belo" the noniral valups. Thus fnr these 
tests the nominal confidence coefficiont tenths tc imderestimata truth, r.n., nore 
than 9S% of the "95?' confidence intervals" cnvr^r the true oarameter. I'hen nercent 
relative error, defined as the atsolut? error divi^'ed by t!:c nominal a. Is considere^;! 
it is found to vary from U% for .DC confidence coefficient, d=.3, onen interval 
(E^ = .086) to 62r. for .95 coefficient, p«.6 i^r, = .•'^19). Naturally, percent rela- 
tive errcrs are larnpr for 95?; tfian 50% nominal intervals and, excludino the nooriy 
behaved results for thp 10 iter* fTT test wit!: p=.r, cenerallv are belov it 
Is worthy of note that for each of the five rn tests vnth significant values, 
exceeds E^^^ indicatino a shortaoe of lov values of V (and r^p)- (There; is one 
oxcention to .this Irene' -f^or p=:f/l<»3n, rOfj confidence co<rffic1ent) . .T«'i«; 'fact is 
in keenino vith the reriarks made earlier in reference to Feldt's -Findinrs vhic*; 
suoqested that t?>e lov.'er nercentiles of the ernirical distributions vere sliohtly 
laroer than those for the co^narisnn F Histritrution. In the four si^nificpnt 
values for p=.3 and .G, the laroest contribution to the is the contribution from 
the lowest cateoory. There is no p:;rccptable relation bet«-'een test leneth and the 
adeouacy of the estinatinn orocedurcs. 

i'ovino to discuss the HOH tests, wc find that sore emnirical nronortions excae^ 
the one sioina limits at all levels of item intercorrelation and deviate in hot'' 
directions from the noninal values. Of the nine values of F-j vhich '.••ore "signifi- 
cant", seven exceeded the noninal value in^'icatinc true confidence less than notn- 
inal for these oi^en-ended intervals. I'ith the exception of the .95 confidence 
interval for p».l and 30 items (E^ « .n70), the other relative errors were 2Ar> or 
less for these intervals, indicatino, for examn1<?, that oenerally no fewer than 
94% of the 95?; intervals oenerated covered the true parameter. Thus, althouoh at 
variance with the conservatism associated v»ith the estimation nrocedure for the 
HET tests, the results for onen interval estimation for tests satisfvinq the )'0\: 
model still appear to have practical implications. 

For closed intervals, 8 of the 18 values of Fo vere outside of the one sioma 
limits, and, contrary to the results for E^ , seven of the eioht yielded "siorifi- 
cantly too many" intervals vi^ich covered the true parameter. The relative errors 
shoved a definite increase as the iten intercorrelation increased and were rather 
laroe (34% and 44fO for the test vith p*.e. It vill be recalled that the score 
distribution for this test is virtually reetanoulair , hovevcr. Fxamination of the 
Egj^ and E^^ entries Indicates that "'here any differences betveen these tvo fiourrs 
exist, E^^, vrhich represents the proportion of tires t^at the interval totally 
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exceeds the true "aran^ter, is usually larcer than E^, . This is reasonable from 
the results for F, anc^ Indicates that the errlrlcal distributions ter^^ to have too 
much density in the upoer tail and toe little in the lover tail. 

For the H/'RD tests, the results for onen intervals are sirilar to those for th^ 
HOM tests in that all 13 "sianificant" values of F^ v-ore larrer than thp roHnal 
values. However, v'hf?rpas for the 'M' tests the relative errors '.'ore usually 
smaller than 20%, for the i'ARO test tiiey ranae up to Fr% for noHnal .05 coeffi- 
cient, 30 item test Hth p=.3 (£^=.084). The oytretie J-shaner^ score ^distribution 
for p=.e nrovi''es too ff*'-' "nroner intervals for each of the six con-binations of 
confidence coefficient ar(^ number o-f items and the relative errors annear to in- 
crease vith the lopnth of the t*^st. Closf?<" intervals for the HARD test are some- 
v«hat better behaved for the rorr nractical situations of p=.l and .3. The larnest 
relative error amono the six values v/hich •■■ere rcre than one sicft-a fror V->e 
nominal value vas rhich occurod for the same simulation as the 68% for the 
onen interval, /'s a natter of fact, the ,084 nronorticn of overestimates of p^^ 
combined "it!-, precisely the correct number of underestimates (.nsn) to "ield 
E2 = .134. For p*.6v the relative errors arc rather laroe (227- to 58?) and re- 
flect too few intervals coverinn the true value. The pri^an' reason is excessive 
values of E^H' ^^'^^'^ the V distribi-tion anpears to 

fit the F distribution ouite v.'cll. The sicnificant values of 17.C for thr 30 
item HARD tests with p=.3 and p= 6 are nrimarily due to the excess of observations 
in the ton caterory; in each esse , the contribution from f^ose caterories nro- 
vided the lamest contribution to x^- 

The combined results of the four tests from the Ross naner follnv in Table u. 

Table 5 about here 

Ve observe that tests M and X are ouite homocieneous with resnect to difficul- 
ty, the a values beinr much snaller than for the Hf^" and HARD tests simulated. 
Test Z, on the other hand, has an item difficulty spread similar to these tv;o test 
m.odels. Thf? averane latent item intercorrelation for all four tests is larger thar 
.6 and the laroest value of .76 characterized test X. the strono inter-item as- 
sociations cause all p^q's to bo above .CO. The test distributions for these four 
are interestino and vill be related to the simulated tests already discussed. The 
easiest one to comnare is test Y v.'hich is similar in form to the p=.6n. lion test 
except that it is sliohtly more plat»/kurtic (in this case more U-shened). "hen a 
comparison is made acainst the results for the 10 item test in this cell, they 
are found to he very similar. Onen intervals do not cover the parameter as often AS 
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th#> roTiinal copfficient a^virtised HiiIp a shortage of entries in the lovor tail 
caused closed interval construction nrocedures tc be on t^.r conservative sic'e. 

The renainino three tests are nodcrately necatively skeved. In tr>rns of 
slhe'-.-ness and kurtosis, test u and 2 epnear sinilar, but the scor« distributior for 
test Z is sopev;hat rore rectanoular. ;''»it^er of these t\'o distributions has an 
interior nvde. 

The results for test ? follovf the same aeneral lines of those for the 20 item 
I'^^r* test v/ith pf.<^, i.e., E-j values are a little too larnc and E«j^ too snail. It 
Hould seen as thoucih the correspondino in item test vould he useful for copnarino 
to test W, but it soon becomes evident tiiat test " alonr vith test X yield the 
stroncest nenative findinos in the study. Pclative errors of as riuch as }OCZ 
(actually sliohtly laroer) exist for tliese tvn tests. /^If^oueh ouantitatively r^uch 
nore devip.nt, the results follov/ the oeneral tn;nd of the hiohly correlated MO!' and 
r/»RD tests, ranely that there are too nany values in the up-^er tail of the " distri- 
bution and too fev ir thp lover. The test )( sccr'> distribution is U-shaped and 
very extreme. 

It aopears as thounh the selection of real tests to stnulate rra" not have been 
particularly v;ell chosen. The rationale for selectinc these vras one of easy avail- 
ability the infonration necessary for the aeneratiori scheme utilized <'as readily 
available. Unon loolina at the innut values for the four Po?s tests, the only 
oarapeters which varied betveen tests V and Z i-as that cf difficulty distrihution. 
Pecause of this t!;c v.riter decTi^et? to make simulations for tests vith all iters 
of the sane difficulty. These runs were made sinulatino the ten itep tests vith 
p=.6 and yielded f^e results piven in Table 6 belov?. 
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For -rr'.S the test score distribution is U-shaneri similar to test Y fron the 
Ross paper and the VOV test for p=.3. Relative errors for open intervals are 
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arouprt 3n%. The spcor'^ test sinulateH. ^nth Tr=.3, oenerated a sccr« ^^istn'buticr' 
sinilar to that of test " anH th? confidence interva.1 results for these tv-o tpsts 
are very sirilar. Hnpn Intervals havp relativp errors annrnachim 100% ^nd the 
over oonulated u«ner tail caused the close'' irtrrvals to br. in prror hpt^een S^r. 
and m. nore often than the noninal coef^iciert vruld sunrsst. Th« situation ^^e- 
cones much worse for the very difficult test vith Tr=.l. Thf= test score distrihu- 
ticn is extreme, hm-.<ever, '-ith P4? t^e "total scores" oenerated heino zero. 

DlSCnsSlO'! 

Jhri vTiter set hirself to the task of detemininr the extent to "!'ich interval 
estimation cf p^o standard procedures based on tHo F distribution could be 

relied on for moderate sized samples (r!=30) . Results for tests with iter.s spread 
evenly over a v/ide ranee of difficulty and vhich, therefore, resulted in a svnp*triC 
test score distribution were in nood aoreenent vit!^ "noninal" results. For tests 
in which iters vere stronaly associated, test score distributions v/ere nlatvkurtic 
and the noninal confidence coefficient typically ui.derestinated the true nronortion 
of correct statements. I^ost statisticians find tins conservative annroach at 
least tolerable. Vher the items vers spread over a narrower ranoe cf difficulty, 
hut Here still centered at .5, there vas a tendency for toe few open intervals 
to be 'correct* . The relative errors, howev-^r, - ere small, nonerally less than 24% 
For closed intervals the conservative nature of the MFT tests reappeared. ^'>sults 
for test Y from the Ross naner and the tests sinulated «.'1th tt=.5, both of which had 
symetric score distributions were in aoreerent with these results. Therefore, 
when the test score distribution was syifmetric, the most serious results were in 
the direction of conservative procedures. The fact that fewer than the nominal % 
of the open intervals coverc^d the true parameter for the hiably associated HOf 
tests does not seem too serious in that the % error v.-as oenerally small. 

In the situations simulated w!:ere the score distrihution was skewed, it ms 
virtually always true that too few onen intervals covered the true Parameter and % 
errors ranoed up to and sometimes exceeded 100%. The most severely skewed s.core 
distributions, with no interior mode, occurred for the HARD tests with p=.6, three 
of the four Poss tests and the tests with constant difficulty parameters of .3 and 
.1. A somewhat conservative rule which could be used for ooen intervals in these 
cases would be to use the 97.5t.h percentile of the F distribution to cor/struct 
open 06% intervals. The only situation Vffiere sucli an adjustment nrocedure would 
not be either conservative or within reason i^as the Tr=.l {constant), p=.e test 
for which the score distribution v/as almost singular. For closed intervals in the 
case of a skewed score distribution, it is difficult to make any reconmenHation 



ERIC 



based on the data at hand, unless thp items are only moderately associated (p<.3). 
If this is the case/ then the standard np:>cedurG vill yield relative errors nrobabl-^ 
smaller than about ''i^-'' the itpns are rore stronnly associated, ho\'ever, the 

resultinq scor become severely skeved and while tl-o upper tail of 

the V distribu.. t. nonulous, the situation in thp lover tail is loss pre- 

dictable: for the three Ross tests the lover tail is too "lioht", for the tt=.? 
distribution . it is about riaht and for the tt=.1 distribution the procedure falls 
connletely anart. Before sunmarizinq, let us enumerate specifics of this investi- 
ciation Vfhich necessarily limit the Generalizations . They are: 

1. Sample data from thirty respondents v/ere simulated. 

2. The number. of, items ranqed from 10 to 30. 

3. The normal ocive item characteristic curve related the trait" bcin^^ measured 
to the probability of "a 'correct response. 

4. Latent resnonses were samnled from multivariate normal distribution. 

5. Tests simulated had a "sinole factor" structure. 

6. For main body of results, latent item intercorrelations vere constant. 

7. Only 90% and 9B% intervals '"ere considered. 

If a researcher has test data which has these characteristics, he rav v/ish to 
consider the follovino recommendations: 

1. For tests with item difficulty distributions vhich are widel^' spread 
ahout a median of .5, use the proce!turp hut realize that it will tend to 
be conservative. 

2. For tests v.-ith item difficulty distributions vhich are more homoneneous 
about a median difficulty of .5, use the procedure realizing that there 
will be a sliaht tendency for "too fev;" open intervals to cover the true 
parameter if the items are stronoly associated. 

3. For extremely skev/ed test score distributdons thfe safe reccnmendation is- 
to construct open intervals usino the 97.5th percentile of the F distri- 
bution for nominal 35% intervals. The procedure vill tend to be conserva 
tive. 

4. For mildly skewed test score distributions no blanket recommendation is 
possible based on the data. Hovever. if item intercorrelations are TOdest 
so that the rosultinn score distribution has an interior mode and (V. / 

is no more than about .4 or .5, the data sunnest that the standard pro- 
cedure v/ill lead to relative errors of no more than 20% to 30%. 
It is not surprisinn that in situations where the item difficulty is fairly 
homoneneous .and different from .5 and the items are hinhly related that the usual 



-12- 

robustness of the F ciistribution is not sufficient to provide serviceable inforencef 
(Tho readf^r is referred to ^'ardeville (1969) for an extensive investination related 
to hynothesis testina in reneated measures dosirns where the rcneated neasure is 
binary), f^ovever, in most of these cases v/here the annroxination did not nrove use 
ful , true pr>Q vas rather la'r^no (preater than .80). In situations vhero true pp^ 
vms less than .Pn, the nararnetric nrocedure did provide useful results. Since the 
concern of a researcher for the reliability of his measurements is usuallv inversel 
related to p^n> the practical value of these- results appear oreat. 
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P Partial Connarison of Theoretical and Frmlrical 5nt:i and 55th 
Percentiles of rv^ Distributions Peoorted by f'itl'o and Feldt (1969) 
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Theoretical 
5th Percentile 


Umpirical 
5th Percentile 


Theoretical 
10th Percentile 


rnpirical 
IPth Percentile 


It 






I 


II 
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11 




.558 


.356 


.352 


.364 


.4^8 


.411 


.419 


9 


.690 


.551 


- .559 


.561 ' 


.586 


.5S4 


.594 


■.770 


.771 


.666 


.671 
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.693 


.700 


.700 


.825 
.864 


.826 
.865 


.746 
.804 


.755 
.810 


.753 

.8or 


.820 


.773 
.824 


.772 

..... , 

4 



*An averace of the tv'o p^q entries in a rov vas used in the computations 
to obtain the theoretical percentile. 



tl and 11 refer to -itkn and Feldt's "Concentrated" and 'Spread out" item 
difficulty distributions, respectively. 

TABLE 2 



Item Difficulties (:n.) for Three Ten Item Tests Simulated, 
Averaoie Difficulties and Standard Deviations of the 
Item Difficulty Distributions 
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TABLE 4 



Emnirical Probabilities of Incorrect Confidence Statements 
for Cnen-Ended and Closed Confidence Intervals on p«p. 
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* Deciinals omitted in body 

t A reversal of the imnli cation of statements on pace 5 has boen rade for mneTOnfC 
reasons so the E^^ is the nronortion of tines that the total interval was ''too 
hiqh", i.e., C2l>P2o- Similarly F^^ indicates the proportion of times that the 

interval was "too low", i.e., ^^Vi'^^ZO' 
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